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Abstract 

We introduce a new class of nonstationary kernels, which we derive as covariance 
functions of a novel family of stochastic processes we refer to as string Gaussian 
processes (string GPs). We construct string GPs to allow for multiple types of lo¬ 
cal patterns in the data, while ensuring a mild global regularity condition. In this 
paper, we illustrate the efficacy of the approach using synthetic data and demon¬ 
strate that the model outperforms competing approaches on well studied, real-life 
datasets that exhibit nonstationary features. 


1 Introduction 

Over the past two decades, the use of kernel^]has been at the heart of many endeavours in the statis¬ 
tics and machine learning communities to extract hidden structures from datasets. Kernels are often 
used as a flexible way of departing from linear hypotheses in learning machines, thereby allowing 
for more complex nonlinear patterns (TJ [21. They have indeed been successfully applied to prob¬ 
lems of classification, clustering, density estimation and regression. The duality between kernels 
and covariance functions has made kernels a critical tool for both frequentist and Bayesian statisti¬ 
cians. In the Bayesian nonparametrics community, kernels are often used as a covariance function 
of a Gaussian process (GP), introduced as a prior over a latent function. The family of covariance 
functions postulated for the GP is typically chosen so as to express prior domain knowledge about 
the underlying function, such as periodicity, regularity and range. The parameters of the kernel are 
then learned from the data. 

Related work 

When one is concerned with automatically uncovering structures from the data, a flexible family of 
kernels should be used that can account for intricate patterns. Expressive kernels have been proposed 
to this end. In Eia two families of kernels were proposed that are dense in the class of stationary 
kernels, which allows flexibly learning the shape of the kernel from the data. Their approaches are 
however limited in that the learned kernel can be regarded as a global signature for the training 
data, that accounts simultaneously for every pattern/structure evidenced in the training data. As 
a result, prediction using these approaches will result in global extrapolation of local patterns or 
structures; that is prediction with the learned kernel will aim at replicating a mixture of all local 
patterns evidenced in the training dataset without any concept of spatial relevance. This might not 
be appropriate if the data exhibit multiple types of local patterns. 

Varying patterns are often handled using nonstationary kernels. In El a method is proposed for 
constructing nonstationary covariance functions from any stationary one that involves introducing n 
input dependent d x d covariance matrices to be inferred from the data. Their method however scales 
poorly with data size and input dimension, and as noted by the authors ‘it shows little advantage over 
stationary GP on held out data’. Another popular approach for introducing nonstationary kernels 

1 We use the word kernel to denote any symmetric positive semi-definite function. 
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consists of postulating that stationarity holds in a new space resulting from a nonlinear deformation 
d of the original input space (ISO. 8]): k(x, x') = /( \\d(x) — d(V)||), where / is positive definite. 
Although this idea is appealing, it relies on appropriately setting the deformation beforehand. In 
0 the deformation d was learned in a fully nonparametric fashion, using a GP prior with identity 
mean and translation invariant covariance function. However, this effectively leads to a stationary 
kernel (see appendix). Another attempt to learn nonstationarity from the data in a fully Bayesian 
nonparametric way has been proposed by 0 . The authors considered functional priors of the form 
y(x ) = f(x) exp g(x) where / is a stationary GP and g is a scaling function. For a given non¬ 
constant function g such a prior yields a nonstationary Gaussian process. However, when a stationary 
GP prior is put on the function g as in 60 the resulting functional prior y(x) = f(x) exp g(x) 
becomes stationary, with a covariance function that can be determined analytically (see appendix). 

Other authors have introduced nonstationarity by noting that in most cases, heterogeneous patterns 
in the data are locally homogeneous. For example, nni partitioned the domain using Voronoi tes¬ 
sellations and considered independent stationary GPs defined on elements of the partition and m 
used tree based partitioning and postulated independent stationary GPs on the leaves. The two meth¬ 
ods can also be regarded as mixtures of GP experts (| 12] 13 , i 14, 15 ]) on non-overlapping domains. 
This approach has been shown to be very flexible, and is truly nonstationary. It also scales consid¬ 
erably better than alternative approaches such as a, as it has complexity 0(^2 k n k ) —which is 
better than the typical GP complexity 0((^2 k rik ) 3 )— and it introduces fewer additional parame¬ 
ters. However, these methods present two major pitfalls. Firstly, the independence between experts 
results in higher posterior uncertainty, as individual experts have fewer data to learn from. More 
importantly, the resulting functional priors are discontinuous at the boundary of the partition, which 
might not be appropriate if the latent function to be inferred is known to be smooth. 

The approach we present in this paper improves on the above work in that we construct a new 
stochastic process resulting from a collaboration between local GP experts on non-overlapping do¬ 
mains. Each expert is driven by its own arbitrarily flexible kernel, and shares boundary conditions 
with neighbours, conditional on which they are independent. This offers principled nonstationary 
construction from existing kernels, while ensuring the functional prior is continuously differentiable. 
The rest of the paper is structured as follows. In section [2] we construct string Gaussian processes 
indexed on M, we derive their covariance functions and provide multivariate extensions. In section 
[3j we illustrate the necessity for our approach on synthetic data and demonstrate that it outperforms 
alternative models on well studied real-life datasets. We conclude with a discussion in section [4] 

2 String GP Kernels 

In this section we construct a new class of stochastic processes whose covariance functions will give 
rise to string GP kernels. First we consider the joint between a GP and its derivative (when it exists). 

Proposition 1 Let I be an interval, k : / x I —>> M a C 2 symmetric positive definite func¬ 
tion n m : I —)> M a C 1 function. There exists a M 2 valued stochastic process (D t ) teI , D t = 
(zt, z[), such that for all t\, ... ,t n G I, (z tl ,..., z tn , z' ti ,..., z[ ) is a Gaussian vector with 
mean (m(G),..., m(t n ),^(ti),..., yyfifn)) an d covariance matrix such that cov(z ti , z tj ) = 
cov(z ti , z[.) = || ) and cov(z' t ., z [.) = We herein refer to (D t ) teI as 

a derivative Gaussian process (DGP). Moreover, ( z t )tei ^ a GP with mean function m, covariance 
function k and that is C 1 in the L 2 (mean square) sense. Furthermore, (z' t ) te i is a GP with mean 

function ^ and covariance function Q x Q y , and ( z' t ) t ei is the L 2 derivative of the process ( z t )tei • 
Proof: see appendix. 

We consider a kernel k as degenerate at a when a DGP (z t , z t)tei with kernel k is such that z a 
and z' a are perfectly correlate^] \corr(z a , z' a )\ = 1. As an example, the linear kernel k (u,v) = 
a 2 (u — c)(v — c ) is degenerate at 0. Moreover, we will consider a kernel k as degenerate at b given 
a when it is not degenerate at a and when a DGP (z t , z' t ) te i with kernel k is such that the variances 

2 C 1 (resp. C 2 ) functions denote functions that are once (resp. twice) continuously differentiable on their 
domains. 

3 Or equivalently when the Gaussian vector (z a , z' a ) is degenerate. 
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of Zb and z' b conditional on ( z a , z' a ) are both zercQ For instance, the periodic kernel proposed by 
ED with period T is degenerate at u + T given u. 


An important subclass of DGPs in our construction are the processes resulting from conditioning 
paths of a DGP to take specific values at certain times (t i,..., t c ). We herein refer to those processes 
as conditional derivative Gaussian processes (CDGP). As an illustration, when k is C 3 on I x / 
with I = [a, b\, and neither degenerate at a nor degenerate at b given a, the CDGP on I = [a, b] with 
unconditional mean function m and unconditional covariance function k that is conditioned to start 
at (z a , z' a ) an d to end at (£&, z' b ), is the DGP whose mean and covariance functions read: 


m“’ 6 (i) = m(f) + K t;(a)6) K 


-1 

( a,b)-,(a,b ) 


[z a — m(a) z' a — ^jr(a) z b - m(b) z' b - ^ 


K (f, s) — k(f, s) K t; ( ai b)K^ a b y^ a j6 )K s .( a)6 ), 


where K,.„. = 


;a = [k(f,a) f (i,a)] , k t;(a , b) = [K t;a K t;b ] , 


^u;v 


k(u,v) f£(«,*>) 


(«,u) 


<9 2 k 

dxdy 


(u,v) 


and K( aj 5).( aj 5) 




It is worth noting that both K a;a and K ( a ,b);(a,b) are indeed 


invertible because the kernel is assumed to be neither degenerate at a nor degenerate at b given a. 
Hence, the support of (z a , z' a ,Zb , z b ) is M 4 , and any values can be used for conditioning. 


String Gaussian processes 


The following theorem, at the core of our framework, establishes that it is possible to connect to¬ 
gether GPs on a partition of an interval /, in a manner consistent enough that the newly constructed 
stochastic object will be a stochastic process on I and in a manner restrictive enough that any two 
connected GPs will share just enough information to ensure that the constructed stochastic process 
is continuously differentiable (C 1 ) on I in the L 2 sense. 


Theorem 2 Let ao < ••• < clk, I = [clo,uk\ and let pj\f(x; /i, S) be the multivariate Gaus¬ 
sian density with mean vector p and covariance matrix E. Furthermore, let ( m k : [ak-i,a,k\ 
U^)fce[i..K] be C 1 functions, and (Jc k : [ak-i,a,k\ x [ak-i,ak\ ®0/ee[i..x] be C 3 symmetric positive 
definite functions, neither degenerate at a k -\, nor degenerate at ak given ak- 1- 

(A) There exists an M 2 valued stochastic process ( SD t ) te i , SD t = (z t ,z' t ) such that: 

1) The probability density of (SD ao ,...,SD aK ) reads: p b (x 0 ,.. .,x K ) := Y[ k =oP^( x k', dk^k)- 


where: 


4 = 1 *a 


V k > 0 ££ = 


JZ _ JZ F-l TZl 

k ak ',cbk k ak',ak — i k a,k—i',CLk—i k a>k',CLk — i 5 


(i) 


do — lM ao , v k > 0 Pf. — feM afc + kKa k ;a k -i k^a k -i',a k - 1 ( x k-l kM ak _ 1 ), 


( 2 ) 


with i,K„ 


k k (u,v) 


(u,v) 


dfkk 

dxdy 


and k M v 


m k (u) 

^(») 


2) Conditional on (SD ak = £/c)/ce[o..iV]> ^ restrictions (ST> t ) te ] afc _ 1?afc [, k G are mJe- 

pendent CDGPs, respectively with unconditional mean function rrik and unconditional covariance 
function kk and that are conditioned to take values Xk-i and Xk at ak -1 owd ak respectively. We 
refer to (SD t ) te j as a string derivative Gaussian process (string DGP), and to its first coordinate 
(z t )tei as a string Gaussian process (string GP) namely, 


( z t )tei ~ SQV({a k }, { m k }, {k k }). 

(B) The string Gaussian process ( zfjtei defined in (A) is a Gaussian process that is C 1 in the L 2 
sense and its L 2 derivative is the process (z' t ) t ei defined in (A). Proof: see appendix. 


In our collaborative local GP experts analogy, Theorem [2] stipulates that each local expert (string) 
takes as message from the previous expert its left hand side boundary conditions, conditional on 
which it generates its right hand side boundary conditions, which it then passes on to the next 

4 That is when the Gaussian vector (z a , z a ) is not degenerate but (z a , z' a ,z b , z' b ) is. 
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expert. Conditional on their boundary conditions local experts are independent of each other, and 
resemble vibrating pieces of string on fixed extremities, hence the name string Gaussian process. 
Figure |TJa) illustrates 3 independent draws from a CDGP conditioned to start at 0 with derivative 
0 and to finish at 1 with derivative 0. The unconditional GP used was stationary with mean 0 and 
squared exponential covariance function with variance 1 and input scale 0.2. Figure |TJb) illustrates 
a draw from a string GP (z t ) with 3 strings and its derivative (z' t ), under squared exponential kernels 
(green and yellow strings), and the periodic kernel of m (red string). 



Univariate String GP Kernels 


We call string GP kernel the covariance function of a string GP. As we do not impose any restriction 
on the functional form of the unconditional kernels driving strings, it is not possible to derive a 
functional form for string GP kernels , but they are readily available from unconditional kernels. We 
now derive how to evaluate the joint covariance structure of a string GP and its derivative, 


K 7/ 


CO v(z u ,z v ) 
cov(z' u ,z v ) 


CO v(z u . 
CO v(z' u . 


K) 

z v) 


ksGP<,u,v ) 

dksGP -(u,v) 


du 


dksc 

dv 
d ksGP 

dudv 


(u,v 
(u,v)_ 


= K T 


which corresponds to string GP kernels and their partial derivatives. To do so, we need the following 
lemma: 


Lemma 3 Let X be a multivariate Gaussian with mean px and covariance matrix Ex- If condi¬ 
tional on X, Y is a multivariate Gaussian with mean MX + A and covariance matrix E y where 
M, A and E y do not depend on X, then (A, Y) is a jointly Gaussian vector 


with mean px-Y 


Px 

Mp x + A 


and covariance matrix E x-y 


' S x M t 

MS* + MY X M T 


Proof: see appendix. 


To derive K u - V , we start by noting that the covariance function of a string GP is the same as that 
of another string GP whose strings have the same unconditional kernels but unconditional mean 
functions m & = 0, so that to evaluate string GP kernels we may assume that Vfc, = 0 without 
loss of generality. We deal with the case where u and v are both boundary times, after which we 
will generalise to other times. 


Boundary times : We note from Theorem [2] that the restriction (zt, z't)te[a 0 ,a i] is the DGP with mean 
0 and covariance function ki. Thus, (z ao , z ' ao , z ai , z' ai ) is jointly Gaussian, and if we let &K u;v be 
as in Theorem [2] 


K„ 


= iK a 


K ai;ai — iK ( 


ai ;ai 5 


K„ 


= iK a 


( 3 ) 


We recall that conditional on the boundary conditions at or prior to ak-i, ( z 0k . z ' Jk ) is Gaussian 
with mean b k M [z„ h _, where b k M = k K ak . ak _ 1 feK^ 1 _ i;a) ,_ i , and covariance matrix 

feE = kY, h :a k ~ b k M feKa fc _ i; a fe • Hence using Lemma[3]with M = [ k M 0 ... 0] where there 

are (k — 1) null block 2x2 matrices, and noting that ( z ao , z ' aQ ,..., z CLk _ 1 , z' CLk _ i ) is jointly Gaussian, 
it follows that (z ao , z ' aQ ,..., z ak , z ' ak ) is jointly Gaussian, that (z ak ,z' ak ) has covariance matrix 


_ b x 

-* v O'/ e ;a/ c k. 


IM IM 1 


( 4 ) 
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and that the covariance matrix between the boundary conditions at a k and at any earlier boundary 
time a/, l < k reads: 


K afe ;a z — k MKa k ^ t -ar 


(5) 


String times : Let u G [a p _i, a p ], t; G [a g _i, a g ]. By the law of total expectation, we have that 


— F 


[*» 41 1 =E| E 




®(p, q) , 


where B{p, q) refers to the boundary conditions at the boundaries of thep-th and q-th strings, in other 
words | 'z x ,z' x , x G { a p _-|, a p , a q _, a q } |. Furthermore, using the definition of the covariance 


matrix under the conditional law, it follows that 


B(p, q) = cK m; „ + E 


B(p,q))E[[z v 41 


B(p, q) 


( 6 ) 


where C K u . v refers to the covariance matrix between (z u ,z' u ) and (z v , z ' v ) conditional on the bound¬ 
ary conditions B(p , g) and can be easily evaluated from Theorem [5] In particular, 


where 


if P 7^ (7 5 c^u-,v — 0? an d if P — g, c-^-u;v — 

Lir 


lfT 

K T 

P^Vldp 


(7) 


l^ai\ai -1 


-1 


We also note that 


S(p, g) = P A 


%-i 

z' 

Q-p— 1 

Z n - 


and E^ [z v z’ v ] B{p,q)^j = [z aq _ 1 4,_x z a q z a q ] qK 


Hence, taking the expectation with respect to the boundary conditions on both sides of Equation ([6]), 
we obtain: 


V U G [<2p_i,(2p], V G i 5 ttg] 5 — qK. u - v H- p A v 


^‘0'p—l]CLq — l ljQ.q 

^^■CLp]CLq— l -^ClpjClq, 


a t 


( 8 ) 


where c K n;v is provided in Equation (|7|. 


Multivariate String GP Kernels 

Univariate string GP kernels may be combined for multivariate tasks following approaches such as 
hierarchical kernel learning f lfTT! ) or the additive kernels of fl8l . Both approaches yield vastly more 
flexible kernels than multivariate kernels that use a specific norm (Euclidean or Mahalanobis) in the 
input space. Moreover, the special case where a product of univariate string GP kernels is considered 
extends the Automatic Relevance Determination model of d. In effect, not only can different input 
dimensions have different hyper-parameters, but each string in every input dimension may have its 
own hyper-parameters, allowing for Automatic Local Relevance Determination (ALRD). 


3 Experiments 


Synthetic data: In our first experiment, we highlight the limitation of standard GPs, in which a 
global covariance structure is postulated on the domain. This approach can result in unwanted global 
extrapolation of local patterns. We show that this limitation is addressed by string GP kernels. We 
use 2 toy regression problems with the following functions: 



sin(607r£) 

sin(167r£) 


t G [0,0.5] 
t G]0.5,1] 



sin(167r£) 

\ sin(327r£) 
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t G [0,0.5] 
t g]0.5, 1] 


(9) 





























/o (resp. /i) undergoes a sharp (resp. mild) change in frequency and amplitude at t = 0.5. We con¬ 
sider using their restrictions to [0.25, 0.75] for training. We sample those restrictions with frequency 
300, and we would like to extrapolate the functions to the rest of their domain using GP regression 
(CD). We compare string GP kernels with string boundary times {0, 0.5,1} to popular and expres¬ 
sive kernels. We use as unconditional kernels for strings the periodic kernel of osa (String Periodic) 
and a spectral mixture kernel with a single spectral component per string (String Spec. Mix.). Figure 
[2]illustrates plots of the posterior means for each kernel used, and Table[l]compares predictive errors. 

Overall, it can be noted that string GP kernels 
outperform competing kernels, including the 
expressive spectral mixture kernel &sm(0 = 

Ea=l W a ex P(- 27r2r2<J a) cos ( 27rr Ma) of 0) 

with A = 10 mixture components]^] The com- 
parison between the spectral mixture kernel and 
the string spectral mixture kernel is of particu¬ 
lar interest, since spectral mixture kernels are 
dense in the family of stationary kernels, and 
thus can be regarded as a general purpose fam¬ 
ily to learn stationary kernels from the data. In 
our experiment, the string spectral mixture ker¬ 
nel with a single mixture component per string significantly outperforms the spectral mixture kernel 
with 10 mixture components. This intuitively can be attributed to the fact that, regardless of the 
number of mixture components in the spectral mixture kernel, the learned kernel must account for 
both types of patterns present in each training dataset. Hence, each local extrapolation on each side 
of 0.5 will attempt to make use of a mixture of both amplitudes and both frequencies evidenced in 
the training datasets, and will struggle to recover the true local sine function. It should be expected 
that the performance of the spectral mixture kernel in this experiment will not improve drastically 
as the number of mixture components increases. However, under a string GP prior, the left and right 
hand side strings are independent conditional on the (unknown) boundary conditions. Therefore, 
the training dataset on [0.25, 0.5] influences the hyper-parameters of the string to the right of 0.5 
only to the extent that both strings should agree on the value of the latent function and its deriva¬ 
tive at 0.5. To see why this is a weaker condition, we consider the family of pair of functions: 
(auji sin(u; 2 t), ^2 sin(ccit)), Ui = ki, ki £ N, a G M. Such functions always have the same 
value and derivative at 0.5, regardless of their frequencies, they are plausible GP paths under a spec¬ 
tral mixture kernel with one single mixture component (fi a = ki and cr a <C 1), and can be exactly 
reproduced with the periodic kernel of EE As such, it is not surprising that extrapolation under a 
string spectral mixture kernel should perform well, and that a string periodic kernel should recover 
the function almost perfectly. 





MAE 



Kernel 

/o 



h 



Squared Exponential 

0.89 

zb 

2.01 

0.48 

zb 

0.58 

Rational Quadratic 

0.71 

zb 

2.18 

0.27 

zb 

0.81 

Matern 3/2 

0.81 

zb 

2.37 

0.55 

zb 

1.45 

Spec. Mix. (10 comp.) 

0.72 

zb 

1.12 

0.20 

zb 

0.30 

String Spec. Mix. 

0.23 

zb 

0.84 

0.06 

zb 

0.21 

String Periodic 

0.12 

zb 

0.14 

0.09 

zb 

0.11 


Table 1 : Predictive mean absolute error d= 2 std on 
the GP extrapolations of /o and /i. 



Motorcycle data: The objective of this experiment is three-fold. Firstly, we show on the well stud¬ 
ied motorcycle dataset of f20l that using guestimates for the partition of the domain, along with 
string GPs, outperforms the standard GP model. Secondly, we show that our approach considerably 
outperforms the alternatives proposed by ['10, Till , that consist of dividing the training dataset in 
disjoint subsets and performing independent training; this confirms the importance of the collab¬ 
oration between experts. Thirdly we illustrate learning the derivative of the function from noisy 


5 The sparse spectrum kernel of 0 can be thought of as the special case cr a <C 1. 
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String GP (6 strings) 


String GP (6 strings) 



Time (ms) 



Time (ms) 


Figure 3: Posterior mean ± 2 predictive standard deviations on the motorcycle dataset ( |j20l ), under 
a Matem 3/2 derivative string GP prior with 6 equal length string intervals. 


measurements. The observations consist of accelerometer readings taken through time in an exper¬ 
iment on the efficacy of crash helmets. We consider inferring the latent acceleration curve using 
GP regression, with an heteroskedastic Gaussian noise model, allowing the noise variance to be 
constant within a string interval and different between string intervals. We consider the two sets of 
boundary times {0,15, 30,45, 60} and {0,10, 20, 30,40, 50, 60}. We ran 50 independent random 
experiments, leaving out 5 points selected uniformly at random from the dataset for prediction, the 
rest being used for training. The models we compared include a vanilla GP, for each of the above 
set of boundary times the corresponding string GP, and mixtures of independent GP experts. The 
Matern 3/2 kernel was used as base throughout. The results are reported in Table [2] 



Training 


Prediction 


Log lik. 

Log lik. 

MAE 

Avg std 

RR std 

Vanilla GP 

-420.80 

-22.71 

16.24 

3.10 

0.23 

String GP (4 strings) 

-376.51 

-20.54 

16.21 

2.02 

1.26 

String GP (6 strings) 

-374.62 

-20.58 

15.83 

2.14 

1.03 

Mix. of 4 GPs 

-367.23 

-29.31 

20.13 

7.58 

1.86 

Mix. of 6 GPs 

-456.04 

-39.11 

28.86 

17.53 

1.12 

Regul. Mix. of 4 GPs 

-314.55 

-22.80 

17.41 

4.27 

1.34 

Regul. Mix. of 6 GPs 

-273.68 

-22.73 

18.32 

3.78 

1.25 


Table 2: Performance comparison on the motorcycle dataset ((2b])- RR 
std is the average relative range (i.e. (max - min)/avg) of the posterior 
std. 


It can be seen from Table 
[2] that string GPs outper¬ 
form the vanilla GP and 
mixtures of GPs in terms 
of average out-of-sample 
predictive log likelihood 
of the (noisy) left out ac¬ 
celerations, regardless of 
the set of boundary times 
used. More importantly, 
we note that this also held 
true individually for all 
of the 50 random experi¬ 
ments we ran. Moreover, 


string GPs yielded more certain predictions (Avg std) for the values of the latent function at test 
inputs than competing models. Furthermore, the heteroskedastic noise model, which fits naturally 
within the string GP framework, allows a wider range of posterior standard deviation (RR std) than 
the vanilla GP. Predictive certainty is indeed higher under string GP priors at the beginning and end 
of the experiment, which is intuitive as the magnitude of the noise in those parts is smaller. As 
for mixtures of independent GPs ([10, IF]), they significantly and consistently underperform both 
vanilla and string GPs out-of-sample, primarily because the latent acceleration is expected to be 
continuous. We added an L 2 penalty on the input length scale to the marginal likelihood in order to 
mitigate the discontinuity of the learned posterior means. Although the results are improved, their 
predictive performances remain worse than string GPs’, and they are prone to over-fitting. We also 
learned the derivative of the latent acceleration with respect to time (Figure [3] right), purely from 
noisy acceleration measurements using the joint law of a string GP and its derivative (Theorem [2]). 


Global air temperature anomalies: Next, we illustrate Automatic Local Relevance Determina¬ 
tion by comparing popular ARD kernels with products of univariate string GP kernels. We also 
illustrate that the gradient of the latent function can be learned jointly with the latent function (see 
appendix). The experiment is based on the well studied temperature anomalies dataset of ED. The 
dataset consists of monthly readings of air temperature anomalies at various points on the globe 
in December 1993. The authors defined air temperature anomaly as the deviation of a monthly 
temperature at a given location from the average over the period 1950-1979 of the monthly temper- 
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Training 

Prediction 


Log lik. 

Log lik. 

MAE 

Avg std 

ARD Poly 

-452.32 

-39.06 

0.61 

0.62 

ALRD Poly 

-423.68 

-38.80 

0.65 

0.56 

ARD SE 

-452.32 

-39.06 

0.61 

0.61 

ALRD SE 

-423.24 

-38.64 

0.66 

0.56 

ARD RQ 

-452.17 

-38.37 

0.59 

0.60 

ALRD RQ 

-426.90 

-28.88 

0.44 

0.25 

ARD MA 3/2 

-452.37 

-33.49 

0.50 

0.50 

ALRD MA 3/2 

-423.14 

-33.09 

0.55 

0.40 

ARD MA 5/2 

-452.09 

-36.24 

0.55 

0.52 

ALRD MA 5/2 

-419.62 

-35.28 

0.59 

0.44 


Table 3: Automatic Local Relevance Determi- 



Figure 4: Posterior mean and gradient 
vector field of the global air temperature 
anomalies of (Til (°C) in December 1993, 
under a string GP Matern 3/2 kernel. 


nation using string GP kernels on the global air 
temperature anomalies dataset. 


atures at the same location. There were 445 readings in December 1993. We ran 50 experiments, 
each using 50 points selected uniformly at random for testing and using the rest for training. We 
considered several popular kernels; a product of univariate second order polynomial kernels (i.e. 
fc(x, y) = a 2 (xy + c) 2 , cr, c > 0), the ARD squared exponential kernel, the ARD rational quadratic 
kernel, and the ARD Matern 3/2 and 5/2 kernels. We compared each of the above kernels with a 
product of univariate string GP kernels that has unconditional kernels in the same family and with 
boundary times {—90, 0, 90} in the latitude dimension and {0,180,360} in the longitude dimension. 
Average training and predictive results are summarized in Table [3] We note that string GP kernels 
(ARLD) have higher predictive log likelihood and yield more certain predictions than their ‘non- 
string’ counterparts, despite the simplicity of our choice of boundary times. Although ARD kernels 
sometimes yielded lower mean absolute errors, the difference in those cases was never in excess of 
0.05°C, which corresponds to just 2.5% of the sample standard deviation of anomalies (1.99 °C) 
and about 6% of the learned noise standard deviation (ranging from 0.82°C to 0.98°C). Figure [ 5 ] 
illustrates a map of the globe with the posterior mean of the latent anomalies and the posterior mean 
gradient vector field. 


4 Discussion 

In this paper we introduce a new class of nonstationary kernels we refer to as string GP kernels , 
that allow learning heterogeneous local patterns in the data, while ensuring global smoothness. We 
demonstrate the need for our approach on synthetic data, and we show that our approach outperforms 
competing alternatives on well studied real-life regression problems. 

Domain partition: As illustrated by the last two experiments, in many applications, naively par¬ 
titioning the domain might considerably improve on stationary kernels. The accuracy may be fur¬ 
ther improved by considering uniform partitions of the domain and learning the partition size by 
cross-validation using predictive likelihood or root mean square error as out-of-sample performance 
metric. Alternatively, the number and positions of boundary times may be jointly learned in a fully 
Bayesian setting by putting a point process prior on boundary times, and sampling from both the 
posterior intensity and the positions of the points using MCMC. 

Future work: Although the focus of this paper has been on improving predictive accuracy, the con¬ 
ditional independence between local experts may be exploited to develop exact inference methods 
that scale an order of magnitude better than the standard approach. 
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Supplementary Material for String GP Kernels 


Appendix A Additional Results 

Appendix A.l Motorcycle dataset 

Figure [5]illustrates posterior means and d= 2 standard deviations confidence bands for the motorcycle 
experiment. 

Appendix A.2 Additional Experiments 

In the first experiment we have demonstrated that string GP kernels outperform competing kernels 
on univariate extrapolation tasks in the presence of local patterns. This additional experiment aims 
at reinforcing this message by considering two bivariate interpolation problems exhibiting local 
patterns, and with inputs forming a grid. We introduce two functions f 2 and 

Vu,v e [0,1], f 2 (u,v) = fo(u)fi(v) and f 3 (u,v) = \Jf${u) + fi(v), (10) 

where /o and /i are the functions of our univariate synthetic experiment. For training, we use the 
restrictions of f 2 and to [0, 0.4] U [0.6,1] x [0, 0.4] U [0.6,1], which we evaluate on a uniform 
grid with mesh size 1/300 (72000 training points). We consider recovering the two functions on 
[0,1] x [0,1] using GP regression with i.i.d. Gaussian noise terms (90000 test inputs), and compare 
a product of string GP kernels with competing alternatives. We refer to (4|, for tricks to leverage the 
Kronecker structure of the resulting covariance matrix to yield inference in 0(N 2 ) computational 
time and O(N) memory requirement. We compare a product of string periodic kernels (String 
Per.), a product of periodic kernels introduced by m (Per.), a product of squared exponential 
kernels (SE), a product of Matern 3/2 kernels (MA 32), a product of rational quadratic kernels (RQ), 
and a product of spectral mixture kernels (SM). We use as boundary times for string GP kernels 
{0,0.5,1}. Figures [6] and J7] illustrate the ground truth (top left corner), the training data (top right 
corner), along with posterior means under the aforementioned kernels on f 2 and fo respectively. It 
can be seen that a product of string periodic kernels considerably outperforms alternatives in both 
cases. Competing kernels tend to fuse local patterns in the interpolation, whereas string GP kernels 
recover the underlying functions almost perfectly. Interestingly, the product of string GP kernels 
almost perfectly recovers fa, despite it not being a separable function. 

Appendix B Proofs 

Appendix B.l Proof that the fully nonparametric model of (U yields a stationary prior 

We want to prove that if d is a Gaussian process with mean the identity function and covariance 
function invariant by translation, and if conditional on d (z x ) is a GP with mean function that does 
not depend on d and conditional covariance function k(x,x f ) = /( \\d(x) — d(x')||) where / is 
positive definite, then (z x ) has (unconditional) covariance function invariant by translation. 

Firstly, we note that d can be rewritten as d(x) = x + d c {x) where d c is a stationary GP with mean 
0. Hence, the conditional covariance can be rewritten as 

CO v(z x ,z x+h I d) :=k (x,x + h) = /(|| -h + d c (x) - d c (x + h) ||). 

By law of total covariance, 

co v(z x ,z x+h ) = E(/(|| -h + d c (x) - d c (x + h )||)) +cov(E(z x | d),E (z x+h | d)). 

The first expectation is with respect to the law of d c {x) — d c (x + h), which is the same as the law of 
d c (0) — d c (h ) by stationary of d c . Hence, E(/(|| — h + d c (x) — d c (x + h) ||)) does not depend on x. 
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Vanilla GP 



String GP (4 strings) 



Time (ms) 
Mix. 4 GPs 



Time (ms) 
Mix. 6 GPs 



Time (ms) 


String GP (6 strings) 



Mix. 4 GPs Penal. 



Mix. 6 GPs Penal. 



Figure 5: Posterior mean ± 2 predictive std on the motorcycle dataset 120) under a vanilla GP, string 
GPs, and mixtures of GP experts (with and without prior on the input scale). 


Moreover, as the conditional mean E (z x | d) does not depend on d, cov(E(z x | d), E ( z x+h | d)) — 0, 
which concludes the proof. 


Appendix B.2 Proof that the fully nonparametric GPPM model of |||| yields a stationary 
prior 

We prove that if / and g are two independent mean zero stationary GPs, then y(x ) = f(x ) exp(g(x)) 
is also stationary, and we derive its covariance function. 
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Figure 6: Comparison of kernels in the GP interpolation of / 2 . 


Firstly, we recall that covariance stationarity of Gaussian processes implies strong stationarity. 
Moreover, strong stationarity of g implies strong stationarity of exp (g). Finally, as the product 
of strongly stationary processes is strongly stationary, y is also strongly stationary, which implies it 
is also covariance stationary. 
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Figure 7: Comparison of kernels in the GP interpolation of f%. 


As for the covariance function of y, we note that y is also mean zero as 

E (y(x)) l~E(f(x)exp(g(x))) = E(f(x))E(exp(g(x))) = 0, 
where the second equality results from the independence between / and g. 
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Hence, 


CO v(y(u),y(v)) =E(y(u)y(v)) 

= E(f(u)f(v))E( exp(g(u) + g(v))) 

= kf(u — v) exp (fe 9 (0) + k g (u — v)), (11) 

where kf(u — v), k g (u — v) are the covariance functions of / and g. The second equality results 
from the independence between / and g and the last equality follows from the moment generating 
function of a Gaussian. 

Appendix B.3 Proof of the existence of derivative Gaussian processes 

See Appendix B of our string Gaussian processes paper. 

Appendix B.4 Proof of the existence of string Gaussian processes 

See Appendix C of our string Gaussian processes paper. 

Appendix B.5 Proof of the lemma of the paper 

See Appendix G of our string Gaussian processes paper. 


Appendix C Joint inference of a latent function and its gradient under a GP 
prior 


Let us consider the following GP regression problem: 

\/x € y(x) = f(x) + e(x), f ~ QV(0, k (.,.)), e(x)J\f(0,a 2 ). 

It can be shown in a similar fashion to the proof of the existence of derivative Gaussian processes that 
if k is C 2 , then / is C 1 in the L 2 sense and (/, V/) is a M d+1 valued Gaussian process. Moreover, 
we have: 

CO <v(y(u), = cov(/(u), (12) 


cov (S" (w), ^~ (v) ) = 


dxo 


dujdvi 


(13) 


where §^-{v) denotes the partial derivative of the latent function with respect to the j-th input 
dimension at v, J^;(u, v) denotes the partial derivative of the covariance function with respect to 

the j-th coordinate of the second input evaluated at (u, v), and q!^.q , (u, v ) denotes the second order 
partial derivative of the covariance function with respect to the i- th coordinate of the first input and 
the j-th coordinate of the second input evaluated at (u,v). 


When k is separable (as in the global air temperature anomalies) and k(u,v) = Ylj=i kj( u ji v j 
follows that 

r)k f)k • 

o-( u ^) = -rrr( u j^j)Yl k i(u h vi), 


)’ it 




r) 2 k r)k■ r)k • 

3 3 mi,3} 


and 


d 2 k 

duodv 


(u,v) 


duidv a 


(ui,Vi)Y[ki(ui,vi). 


We recall that when all univariate kernels in the product are string GP kernels , ^ (uj ,Vj), 


^(ui,Vi) and 


d z kj 
dundv , 


■(ui, Vi) are elements of the matrix K u . v derived in the paper. 
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Overall, the posterior distribution of the gradients at test points given noisy measurments of the 
latent function at training points is Gaussian and its mean and covariance matrix are obtained using 
standard Gaussian identities (da) and the joint covariance structure we just derived. 
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